{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading columnar data into Tribuo\n",
    "This tutorial demonstrates Tribuo's systems for loading and featurising complex columnar data like csv files, json, and SQL database tables.\n",
    "\n",
    "Tribuo's `Example` objects are tuples of a feature list and an output. The features are tuples of String names, and double values. The outputs differ depending by task. For example, classification tasks use `Label` objects, which denote named members of a finite categorical set, while regression tasks use `Regressor` objects which denote (optionally multiple, optionally named) real values. (Note that this means that Tribuo `Regressor` s support multidimensional regression by default). Unlike standard Machine Learning data formats like svmlight/libsvm or IDX, tabular data needs to be converted into Tribuo's representation before it can be loaded.  In Tribuo this ETL (Extract-Transform-Load) problem is solved by the `org.tribuo.data.columnar` package, and specifically the `RowProcessor` and associated infrastructure.\n",
    "\n",
    "## The RowProcessor\n",
    "Constructing an instance of `RowProcessor<T>` equips it with the configuration information that it needs to accept a `Map<String,String>` representing the keys and values extracted from a single row of data, and emit an `Example<T>`, performing all specified feature and output conversions. In addition to directly processing the values into features or outputs, the `RowProcessor` can extract other fields from the row and write them into the `Example` metadata allowing the storage of timestamps, row id numbers and other information.\n",
    "\n",
    "There are four configurable elements to the `RowProcessor`:\n",
    "- `FieldProcessor` which converts a single cell of a columnar input (read as a String) into potentially many named `Feature` objects.\n",
    "- `FeatureProcessor` which processes a list of `Feature` objects (e.g., to remove duplicates, add conjunctions or replace features with other features).\n",
    "- `ResponseProcessor` which converts a single cell of a columnar input (read as a String) into an `Output` subclass by passing it to the specified `OutputFactory`.\n",
    "- `FieldExtractor` which extracts a single metadata field while processing the whole row at once.\n",
    "\n",
    "Each of these is an interface which has multiple implementations available in Tribuo, and we expect users to write custom implementations for specific processing tasks.\n",
    "\n",
    "The `RowProcessor` can be used at both training and inference time. If the output isn't available in the `Example` because it's constructed at inference/deployment time then the returned `Example` will contain a sentinel \"Unknown\" output of the appropriate type. Thanks to Tribuo's provenance system, the `RowProcessor` configuration is stored inside the `Model` object and so can be reconstructed at inference time with just the model, no other code is required.\n",
    "\n",
    "\n",
    "## ColumnarDataSource\n",
    "The `RowProcessor` provides the transformation logic that converts data from rows into features, outputs, and metadata, and the remainder of the logic that generates the row and `Example` objects resides in `ColumnarDataSource`. In general you'll use one of it's subclasses: `CSVDataSource`, `JsonDataSource` or `SQLDataSource`, depending on what format the input data is in. If there are other columnar formats that you'd like to see, open an issue on our [GitHub page](https://github.com/oracle/tribuo) and we can discuss how that could fit into Tribuo.\n",
    "\n",
    "In this tutorial we'll process a couple of example files in csv and json formats to see how to programmatically construct a `CSVDataSource` and a `JsonDataSource`, check that the two formats produce the same examples, and then look at how to use the `RowProcessor` at inference time.\n",
    "\n",
    "## Setup\n",
    "First we need to pull in the necessary Tribuo jars. We use the classification jar for the small test model we build at the end, and the json jar for json processing (naturally)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%jars ./tribuo-classification-experiments-4.3.0-jar-with-dependencies.jar\n",
    "%jars ./tribuo-json-4.3.0-jar-with-dependencies.jar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the necessary classes from Tribuo and the JDK:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import java.nio.file.Files;\n",
    "import java.nio.file.Paths;\n",
    "import java.nio.charset.StandardCharsets;\n",
    "import java.util.Locale;\n",
    "import java.util.stream.*;\n",
    "\n",
    "import com.oracle.labs.mlrg.olcut.config.ConfigurationManager;\n",
    "import com.oracle.labs.mlrg.olcut.provenance.ProvenanceUtil;\n",
    "\n",
    "import org.tribuo.*;\n",
    "import org.tribuo.data.columnar.*;\n",
    "import org.tribuo.data.columnar.processors.field.*;\n",
    "import org.tribuo.data.columnar.processors.response.*;\n",
    "import org.tribuo.data.columnar.extractors.*;\n",
    "import org.tribuo.data.csv.CSVDataSource;\n",
    "import org.tribuo.data.text.impl.BasicPipeline;\n",
    "import org.tribuo.json.JsonDataSource;\n",
    "import org.tribuo.classification.*;\n",
    "import org.tribuo.classification.sgd.linear.LogisticRegressionTrainer;\n",
    "import org.tribuo.util.tokens.impl.BreakIteratorTokenizer;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading a CSV file\n",
    "Tribuo provides two mechanisms for reading CSV files, `CSVLoader` and `CSVDataSource`. We saw `CSVLoader` in the classification and regression tutorials, and it's designed for loading CSV files which are purely numeric apart from the response column, and when all non-response columns should be used as features. `CSVLoader` is designed to get a simple file off disk and into Tribuo as quickly as possible, but it doesn't capture most uses of CSV files. It is intentionally quite restrictive, which is why `CSVDataSource` exists. `CSVDataSource` in contrast provides full control over how features, outputs and metadata fields are processed and populated from a given CSV file and is the standard way of loading CSV files into Tribuo.\n",
    "\n",
    "Now let's look at the first few rows of the example CSV file we're going to use (note these example files live next to this notebook in Tribuo's tutorials directory):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id,timestamp,example-weight,height,description,transport,disposition,extra-a,extra-b,extra-c\n",
      "1,14/10/2020 16:07,0.50,1.73,\"aged, grey-white hair, blue eyes, stern disposition\",police-box,Good,4.83881,-0.7685,0.87706\n",
      "2,15/10/2020 16:07,0.50,1.73,\"impish, black hair, blue eyes, unkempt\",police-box,Good,1.10026,0.51655,0.9632\n",
      "3,16/10/2020 16:07,1.00,1.9,\"grey curly hair, frilly shirt, cape, blue eyes\",police-box,Good,0.78601,0.49001,3.03926\n",
      "4,17/10/2020 16:07,1.00,1.91,\"brown curly hair, long multicoloured scarf, jelly babies\",police-box,Good,3.56195,2.11667,-0.55366\n"
     ]
    }
   ],
   "source": [
    "var csvPath = Paths.get(\"columnar-data\",\"columnar-example.csv\");\n",
    "var csvLines = Files.readAllLines(csvPath, StandardCharsets.UTF_8);\n",
    "csvLines.stream().limit(5).forEach(System.out::println);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three metadata fields (\"id\", \"timestamp\" and \"example-weight\"), a numerical field (\"height\"), a text field (\"description\"), two categorical fields (\"transport\" and \"disposition\"), and a further three numerical fields (\"extra-a\", \"extra-b\", \"extra-c\"). This is a small generated dataset, without a clear classification boundary, it's just to demonstrate the features of the `RowProcessor`. We're going to pick \"disposition\" as the target field for our classification, with the two possible labels \"Good\" and \"Bad\".\n",
    "\n",
    "We construct the necessary field processors, one that uses the double value directly for height, one which processes the field using a text pipeline emitting bigrams for description, and one which generates a one hot encoded categorical for transport. For more details on the options for processing text you can see the [document classification tutorial](https://github.com/oracle/tribuo/blob/main/tutorials/document-classification-tribuo-v4.ipynb) which discusses Tribuo's built-in text processing options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var textPipeline = new BasicPipeline(new BreakIteratorTokenizer(Locale.US),2);\n",
    "var fieldProcessors = new ArrayList<FieldProcessor>();\n",
    "fieldProcessors.add(new DoubleFieldProcessor(\"height\"));\n",
    "fieldProcessors.add(new TextFieldProcessor(\"description\",textPipeline));\n",
    "fieldProcessors.add(new IdentityProcessor(\"transport\"));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the remaining three fields we use the regular expression matching function to generate the field processors for us. We supply the regex \"extra.\\*\" and the `RowProcessor` will copy the supplied `FieldProcessor` for each field which matches the regex. In this case it will generate three `DoubleFieldProcessors` in total, one each for \"extra-A\", \"extra-B\" and \"extra-C\". Note the field name that's supplied to the `DoubleFieldProcessor` will be ignored when the new processors are generated for each field which matches the regex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "var regexMappingProcessors = new HashMap<String,FieldProcessor>();\n",
    "regexMappingProcessors.put(\"extra.*\", new DoubleFieldProcessor(\"extra.*\"));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we construct the response processor for the \"disposition\" field. As it's a categorical and we're performing classification then the standard `FieldResponseProcessor` will do the trick."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "var responseProcessor = new FieldResponseProcessor(\"disposition\",\"UNK\",new LabelFactory());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we setup the metadata extraction. This step is optional, the row processor ignores fields that don't have a `FieldProcessor` or `ResponseProcessor` mapping, but it's useful to be able to link an example back to the original data when using the predictions downstream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var metadataExtractors = new ArrayList<FieldExtractor<?>>();\n",
    "metadataExtractors.add(new IntExtractor(\"id\"));\n",
    "metadataExtractors.add(new DateExtractor(\"timestamp\",\"timestamp\",\"dd/MM/yyyy HH:mm\")); "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the `DateExtractor` the first \"timestamp\" is the name of the field from which we're extracting, while the second is the name to give the extracted date in the metadata store.\n",
    "\n",
    "We'll also make a weight extractor which reads from the \"example-weight\" field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "var weightExtractor = new FloatExtractor(\"example-weight\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can construct the `RowProcessor`. If you don't want to weight the examples you can set the second argument to null. Similarly we're not doing any feature processing in this example, so we'll supply `Collections.emptySet()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "var rowProcessor = new RowProcessor.Builder<Label>()\n",
    "    .setMetadataExtractors(metadataExtractors)\n",
    "    .setWeightExtractor(weightExtractor)\n",
    "    .setRegexMappingProcessors(regexMappingProcessors)\n",
    "    .setFieldProcessors(fieldProcessors)\n",
    "    .build(responseProcessor);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a row processor built, we can finally construct the `CSVDataSource` to read our file. Note that the `RowProcessor` uses Tribuo's configuration system, so we can construct one from a configuration file to save hard coding the csv (or other format) schema. We can also write out the `RowProcessor` instance we've created into a configuration file for later use. We'll look at this later on when we rebuild this row processor from a trained Model's provenance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of examples = 20\n",
      "Number of features = 148\n",
      "Label domain = [BAD, GOOD]\n"
     ]
    }
   ],
   "source": [
    "var csvSource = new CSVDataSource<Label>(csvPath,rowProcessor,true); \n",
    "// The boolean argument indicates whether the reader should fail if an output value is missing. \n",
    "// Typically it is true at train/test time, but false in deployment/live use when true output values are unknown.\n",
    "\n",
    "var datasetFromCSV = new MutableDataset<Label>(csvSource);\n",
    "\n",
    "System.out.println(\"Number of examples = \" + datasetFromCSV.size());\n",
    "System.out.println(\"Number of features = \" + datasetFromCSV.getFeatureMap().size());\n",
    "System.out.println(\"Label domain = \" + datasetFromCSV.getOutputIDInfo().getDomain());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first example and see what features and metadata are extracted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output = GOOD\n",
      "Metadata = {id=1, timestamp=2020-10-14}\n",
      "Weight = 0.5\n",
      "Features = [(description@1-N=,, 1.000000),(description@1-N=aged, 1.000000),(description@1-N=blue, 1.000000),(description@1-N=disposition, 1.000000),(description@1-N=eyes, 1.000000),(description@1-N=grey-white, 1.000000),(description@1-N=hair, 1.000000),(description@1-N=stern, 1.000000),(description@2-N=,/blue, 1.000000),(description@2-N=,/grey-white, 1.000000),(description@2-N=,/stern, 1.000000),(description@2-N=aged/,, 1.000000),(description@2-N=blue/eyes, 1.000000),(description@2-N=eyes/,, 1.000000),(description@2-N=grey-white/hair, 1.000000),(description@2-N=hair/,, 1.000000),(description@2-N=stern/disposition, 1.000000),(extra-a@value, 4.838810),(extra-b@value, -0.768500),(extra-c@value, 0.877060),(height@value, 1.730000),(transport@police-box, 1.000000)]\n"
     ]
    }
   ],
   "source": [
    "public void printExample(Example<Label> e) {\n",
    "    System.out.println(\"Output = \" + e.getOutput().toString());\n",
    "    System.out.println(\"Metadata = \" + e.getMetadata());\n",
    "    System.out.println(\"Weight = \" + e.getWeight());\n",
    "    System.out.println(\"Features = [\" + StreamSupport.stream(e.spliterator(), false).map(Feature::toString).collect(Collectors.joining(\",\")) + \"]\");\n",
    "}\n",
    "printExample(datasetFromCSV.getExample(0));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the output label is `GOOD` and the two metadata fields have been populated, one for the `id` and one for `timestamp`. We extracted a weight of 0.5, as by default it is 1.0. Next come the text unigrams and bigrams extracted from the `description` field. Unigrams are named `description@1-N=<token>` and bigrams are named `description@2-N=<token>,<token>`, and the value is the number of times that unigram or bigram occurred in the text. After the text features comes the three features extracted via the regex expansion, `extra-a`, `extra-b` and `extra-c`, each with the floating point value. Then comes `height` with the floating point value extracted, and finally `transport` extracted as a one-hot categorical feature.  That is, an example can only have either `transport@police-box` or `transport@starship` as a feature with the value `1.0` in any given example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading a JSON file\n",
    "\n",
    "Tribuo's `JsonDataSource` supports reading flat json objects from a json array, which is fairly restrictive. JSON is such a flexible format it's hard to build parsers for everything, but the `JsonDataSource` should be a good place to start looking if you need to write something more complicated.\n",
    "\n",
    "We'll use a JSON version of the CSV file from above, and again we'll print the first few lines from the JSON file to show the format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      " {\n",
      "   \"id\": 1,\n",
      "   \"timestamp\": \"14/10/2020 16:07\",\n",
      "   \"example-weight\": 0.5,\n",
      "   \"height\": 1.73,\n",
      "   \"description\": \"aged, grey-white hair, blue eyes, stern disposition\",\n",
      "   \"transport\": \"police-box\",\n",
      "   \"disposition\": \"Good\",\n",
      "   \"extra-a\": 4.83881,\n",
      "   \"extra-b\": -0.7685,\n",
      "   \"extra-c\": 0.87706\n",
      " },\n",
      " {\n"
     ]
    }
   ],
   "source": [
    "var jsonPath = Paths.get(\"columnar-data\",\"columnar-example.json\");\n",
    "var jsonLines = Files.readAllLines(jsonPath, StandardCharsets.UTF_8);\n",
    "jsonLines.stream().limit(14).forEach(System.out::println);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can re-use the `RowProcessor` from earlier, as it doesn't know anything about the serialized format of the data, and supply it to the `JsonDataSource` constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of examples = 20\n",
      "Number of features = 148\n",
      "Label domain = [BAD, GOOD]\n"
     ]
    }
   ],
   "source": [
    "var jsonSource = new JsonDataSource<>(jsonPath,rowProcessor,true);\n",
    "\n",
    "var datasetFromJson = new MutableDataset<Label>(jsonSource);\n",
    "\n",
    "System.out.println(\"Number of examples = \" + datasetFromJson.size());\n",
    "System.out.println(\"Number of features = \" + datasetFromJson.getFeatureMap().size());\n",
    "System.out.println(\"Label domain = \" + datasetFromJson.getOutputIDInfo().getDomain());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the CSV file and the JSON file contain the same data, we should get the same examples out, in the same order. Note the DataSource provenances will not be the same (as the hashes, timestamps and file paths are different) so the datasets themselves won't be equal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "isEqual = true\n"
     ]
    }
   ],
   "source": [
    "boolean isEqual = true;\n",
    "for (int i = 0; i < datasetFromJson.size(); i++) {\n",
    "    boolean equals = datasetFromJson.getExample(i).equals(datasetFromCSV.getExample(i));\n",
    "    if (!equals) {\n",
    "        System.out.println(\"Example \" + i + \" not equal\");\n",
    "        System.out.println(\"JSON - \" + datasetFromJson.getExample(i).toString());\n",
    "        System.out.println(\"CSV - \" + datasetFromCSV.getExample(i).toString());\n",
    "    }\n",
    "    isEqual &= equals;\n",
    "}\n",
    "System.out.println(\"isEqual = \" + isEqual);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to train a simple logistic regression model, to show how to rebuild the RowProcessor from the model's provenance object (which allows you to rebuild the data ingest pipeline from the model itself).\n",
    "\n",
    "First we train the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "var model = new LogisticRegressionTrainer().train(datasetFromJson);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we extract the dataset provenance and convert it into a configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "var dataProvenance = model.getProvenance().getDatasetProvenance();\n",
    "var provConfig = ProvenanceUtil.extractConfiguration(dataProvenance);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we feed the configuration to a `ConfigurationManager` so we can rebuild the data ingest pipeline used at training time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "var cm = new ConfigurationManager();\n",
    "cm.addConfiguration(provConfig);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the `ConfigurationManager` contains all the configuration necessary to rebuild the `DataSource` we used to build the model. However all we want is the `RowProcessor` instance, as the `JsonDataSource` itself isn't particularly useful at inference time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "RowProcessor<Label> newRowProcessor = (RowProcessor<Label>) cm.lookup(\"rowprocessor-1\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This row processor has the original regexes inside it rather than the concretely expanded `FieldProcessor`s bound to each field that matched the regex, so first we need to expand the row processor with the headers from the original `DataSource` (or our inference time data). Then we'll pass it another row and look at the Example produced to check that everything is working. As this is a test time example, we don't have a ground truth output so we pass in `false` for the boolean `outputRequired` argument to `RowProcessor.generateExample`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output = UNK\n",
      "Metadata = {id=21, timestamp=2020-11-03}\n",
      "Weight = 1.0\n",
      "Features = [(description@1-N=,, 1.000000),(description@1-N=brown, 1.000000),(description@1-N=goatee, 1.000000),(description@1-N=grey, 1.000000),(description@1-N=hair, 1.000000),(description@1-N=leather, 1.000000),(description@1-N=trenchcoat, 1.000000),(description@2-N=,/grey, 1.000000),(description@2-N=brown/leather, 1.000000),(description@2-N=grey/goatee, 1.000000),(description@2-N=grey/hair, 1.000000),(description@2-N=hair/,, 1.000000),(description@2-N=leather/trenchcoat, 1.000000),(description@2-N=trenchcoat/,, 1.000000),(extra-a@value, 0.817540),(extra-b@value, 2.561580),(extra-c@value, -1.216360),(height@value, 1.750000),(transport@police-box, 1.000000)]\n"
     ]
    }
   ],
   "source": [
    "Map<String,String> newRow = Map.of(\"id\",\"21\",\"timestamp\",\"03/11/2020 16:07\",\"height\",\"1.75\",\"description\",\"brown leather trenchcoat, grey hair, grey goatee\",\"transport\",\"police-box\",\"extra-a\",\"0.81754\",\"extra-b\",\"2.56158\",\"extra-c\",\"-1.21636\");\n",
    "var headers = Collections.unmodifiableList(new ArrayList<>(newRow.keySet()));\n",
    "var row = new ColumnarIterator.Row(21,headers,newRow);\n",
    "newRowProcessor.expandRegexMapping(headers);\n",
    "Example<Label> testExample = newRowProcessor.generateExample(row,false).get();\n",
    "\n",
    "printExample(testExample);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the metadata and features have been extracted as before. We didn't supply an \"example-weight\" field so the weight is set to the default value of 1.0. As there was no `disposition` field, we can see the output has been set to the sentinel unknown output, shown here as `UNK`. But we can ask our simple linear model what the disposition for this example should be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Prediction(maxLabel=(BAD,0.96797245146932),outputScores={BAD=(BAD,0.96797245146932),GOOD=(GOOD,0.032027548530680135)})"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var prediction = model.predict(testExample);\n",
    "prediction.toString();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears that our model thinks this example is `BAD`, though personally I'm not so sure that's the right label. Either way, we managed to produce a test time example using only information encoded in our model, so our ETL pipeline is stored safely inside the model, ready whenever we need it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "We've used Tribuo's columnar data infrastructure to process two different kinds of columnar input, csv files and json files. We saw how the central part of the columnar infrastructure, the `RowProcessor`, can be configured to extract different kinds of features, metadata and outputs, and how it is stored along with the rest of the training metadata in a trained model's provenance object. Finally we saw how to extract the `RowProcessor` from the model provenance and use it to generate an example at inference time to replicate the input processing."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "12+33"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
